Overview of Data
The purpose of this dataset was to help find predictors of diabetes.
The goal is to help medical professionals identify patients with
potential risk factors of diabetes.The dataset contains information on
patient demographics such as age and gender, as well as medical
information including blood glucose levels and hypertension. The
observations in this dataset were obtained from research studies and
healthcare institutions. The dataset was obtained via Kaggle.
Mustafa, T. Z. (2021). Diabetes Prediction Dataset. Kaggle.
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
The dataset consists of 8 predictor variables for diabetes:
gender - the patient’s gender as male or
female.
age - the age of the patient in years.
hypertension - yes or no, on whether the patient
has hypertension.
heart disease - yes or no, on whether the
patient has heart disease.
smoking history - the patient’s smoking history
indicated as never, no info, current, former or not current.
bmi - the patient’s body mass index
(kg/m**2).
HbA1c level - the patient’s average blood sugar
level over the past 2-3 months (%).
blood glucose level - amount of blood sugar in
the patient’s blood (mg/dL).
#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data
#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")
#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
fill= "Diabetes")
Comparison of Two Categorical Variables
The purpose of this graph is to show the relationship between having
hypertension and diabetes. For this graph, I changed the values of 1 in
both the diabetes and hypertension variable to “Yes” and the value 0 to
“No.” These integers were meant to represent categorical variables with
1 correlating to being positive for hypertension or diabetes and 0
correlating to being negative to hypertension or diabetes. Based on our
graph, we can see that there is no correlation between having diabetes
and having hypertension, with most patients within our sample having
neither diabetes or hypertension. Thus, hypertension would not be a good
predictor of diabetes within patients.
interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
geom_line(size = 1) + # Lines representing interaction
geom_point(size = 2) + # Points for data
labs(
title = "Interaction of Blood Glucose and HbA1c Levels",
x = "Blood Glucose Level",
y = "HbA1c Level",
color = "Presence of Diabetes"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "top"
)
interact
Comparison of Two Numerical Variables
The two variables being compared here are Blood Glucose Level and
HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level
is a measure of average blood sugar levels over 2-3 months and is
measured in %. Blood glucose level is the amount of blood sugar in a
patients blood at a given time. Our plot shows a correlation between
high levels in HbA1c and blood glucose levels and it’s relationship with
diabetes. This plot suggests patients with high levels of HbA1c and
blood glucose levels are at high risk for diabetes.
#plot of a numerical and categorical variable
boxplot(data$age ~ data$diabetes,
xlab="Presence of Diabetes",
ylab="Patient Age",
col = c("skyblue", "purple"),
main="Visualization of Diabetes by Age",
cex.main = 1.1,
col.main = "navy")
Comparison of a Numerical and Categorical Variables
This chart shows the relationship between age and diabetes diagnosis.
This chart shows that as patients grow older they are at higher risk for
diabetes. This would make age a predictor for diabetes, with the average
age for those without diabetes being 40 years and 60 years for those
with diabetes. However, we can see that we do have a few outliers for
those with diabetes with some patients being diagnosed at 20 or younger.
This could be from the population of people with Type I diabetes which
is usually diagnosed in patients under 20. Our dataset does not
distinguish between type I and type II diabetes.
Conclusion for Comparison of Variables
Diabetes can be predicted using age, blood glucose levels, and HbA1c
levels in patients. However, our results showed hypertension is not a
good predictor of diabetes. Based on our findings, it would be
interesting to look at factors that occur once a patient becomes older
that could lead to diabetes at an older age. As well, diet could be
another factor to look into, such as grams of sugar consumed on a daily
or weekly basis, to see if sugar consumption can predict diabetes.
Another important thing to analyze in future studies would be Type I
versus Type II diabetes, as it could be the reason for some of the
outliers in the boxplot.
Overview of Missing Variables
An important step in data analysis for machine learning is handling
missing values. Missing values in a dataset can occur for many reasons,
but not limited to non-response in surveys, participants leaving the
study, data entry errors and system limitations. If missing values are
not addressed, models can become biased and accuracy of the anaylysis
can decrease. This section examines different imputation strategies to
effectively handle missing values.
The imputation methods we will examine include:
Replacement Imputation for Categorical Features: Mode imputation.
Regression-based Imputation for Numerical Features: Predictive
modeling to estimate missing values.
Multiple Imputation: Advanced methods such as MICE to improve
robustness.
# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE)
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA
Mode Imputation for Categorical Variables
In this method, missing variables are replaced with the most frequent
category of the corresponding variable. Mode imputation ensures
consistency and works well when missing values are randomly distributed.
Below, mode imputation is used on the variable, heart disease, which is
1 if the patient has heart disease and 0 if the patient does not have
heart disease.The variable smoking history had current, former, no info,
never, and ever as possible values. No Info indicates that we do not
have information on the participant’s smoking history and will be
treated as a missing values.
original original2 imputed_modehd imputed_modesh
Min. :0.00000 Length:100000 Min. :0.00000 Length:100000
1st Qu.:0.00000 Class :character 1st Qu.:0.00000 Class :character
Median :0.00000 Mode :character Median :0.00000 Mode :character
Mean :0.03943 Mean :0.03942
3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000
NA's :15
Regression-Based Imputation for Numerical Variables
Using regression models, we can estimate missing values. For the
variable, bmi, we can predict the missing values using estimates from
our model. Bmi is a continuous variable, unlike the previous example
where our values could only be 0 or 1. In this method, a linear model
was created to help create estimates of our values.
#create a dataset for regression impute
regimpute<-diabetesd
# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi)) # Get row indices
# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes,
data = regimpute, na.action = na.exclude)
# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])
dep.mi <- bind_rows(
data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
#imp = "dep.imp1")
i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
MICE Imputation
MICE is multiple imputation techniques used to impute missing values.
This method also uses regression models and accounts for uncertainty by
generating multiple possible values. Using MICE, the variables smoking
history, heart disease, and bmi, can be imputated within one function.
MICE can handle different types of variables, imputates based on the
relationship between variables and it accounts for variability in
missing data.
#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
geom_bar( fill = "purple", color = "#000000", position = "identity") +
ggtitle("MICE") +
labs(x="Smoking History")+
theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
geom_bar( fill = "red", color = "#000000", position = "identity") +
ggtitle("Original Smoking History") +
labs(x="Smoking History")+
theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
geom_bar( fill = "blue", color = "#000000", position = "identity") +
ggtitle("Mode-imputed") +
labs(x="Smoking History")+
theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean BMI in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)
model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5))
term estimate std.error statistic df
1 (Intercept) -1.589271e-01 0.0042324895 -37.549331 7809.1387
2 bmi 8.389138e-03 0.0001290303 65.016790 98218.7361
3 heart_disease1 2.215433e-01 0.0044014918 50.333683 97912.1342
4 smoking_historyever 4.340230e-03 0.0044699572 0.970978 758.3466
5 smoking_historyformer 3.921079e-02 0.0034267231 11.442648 1492.5796
6 smoking_historynever -4.314363e-05 0.0026854000 -0.016066 835.6005
7 smoking_historynot current 6.005163e-03 0.0038588408 1.556209 545.3864
p.value
1 7.985880e-284
2 0.000000e+00
3 0.000000e+00
4 3.318688e-01
5 4.026967e-29
6 9.871856e-01
7 1.202385e-01
The statistics above show the summary of the newly imputed missing
variables through MICE. Based on our p-values, smoking history will most
likely not be chosen for the final model as it’s significance level is
greater than 0.05, indicating an insignificant predictor variable.
Skewness
Skewness is a measure of symmetry within a distribution. The skewness
measurements below are of the variables age, bmi, HbA1c level, and blood
glucose level, respectively. When skewness is equal to 0, we can assume
the data is normally distributed. When skewness is greater than 0, our
data is positively skewed, while a value less than zero indicates
negatively skewed data. For the diabetes dataset, the variables bmi and
blood glucose level are 1 or about 1, which indicate a significant right
skew in the distribution.
age1<- skewness(df_imputed$age)
bmi1<-skewness(df_imputed$bmi)
HbA1c1<-skewness( df_imputed$HbA1c_level)
glucose1<-skewness(df_imputed$blood_glucose_level)
mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
pander(mytable, caption= "Skewness of Numeric Variables",position="center")
Skewness of Numeric Variables
| -0.05198 |
1.043 |
-0.06685 |
0.8216 |
Analyzing Data
The next step is to find which variables are significantly important
when trying to predict diabetes. By using feature selection, we are able
to predict possible best fit models for our variable, diabetes. For this
data, we will focus on logistic regression. The variables of interest,
diabetes, is a binary variable indicating linear regression would not
make an ideal model.
Random Forest
First, we will look at the best model selected by feature selection
through random forest:
# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)
# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)
# Apply RFE to identify top features
results <- rfe(
x = df_trans[, !colnames(df_trans) %in% "diabetes"],
y = df_trans$diabetes,
sizes = c(1:9),
rfeControl = control
)
# Print the selected features
models<-data.frame(predictors(results))
pander(models, caption="Random Forest Selected Model",position="center")
Random Forest Selected Model
| HbA1c_level |
| blood_glucose_level |
| bmi |
| heart_disease |
| age |
| hypertension |
| gender |
The table shows the selected variables random forest has selected to
use for our model
Stepwise Logistic Regression
Next, let’s look and stepwise logistic regression
#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)
# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
Start: AIC=23046.9
diabetes ~ gender + age + hypertension + heart_disease + smoking_history +
bmi + HbA1c_level + blood_glucose_level
Df Deviance AIC
- smoking_history 4 23028 23044
<none> 23023 23047
- gender 1 23077 23099
- heart_disease 1 23176 23198
- hypertension 1 23308 23330
- bmi 1 24361 24383
- age 1 25095 25117
- blood_glucose_level 1 30572 30594
- HbA1c_level 1 34857 34879
Step: AIC=23043.94
diabetes ~ gender + age + hypertension + heart_disease + bmi +
HbA1c_level + blood_glucose_level
Df Deviance AIC
<none> 23028 23044
+ smoking_history 4 23023 23047
- gender 1 23086 23100
- heart_disease 1 23184 23198
- hypertension 1 23314 23328
- bmi 1 24369 24383
- age 1 25191 25205
- blood_glucose_level 1 30581 30595
- HbA1c_level 1 34865 34879
pander(summary(step_model1))
| (Intercept) |
-9.99 |
0.1105 |
-90.37 |
0 |
| gender |
0.2702 |
0.03536 |
7.641 |
2.15e-14 |
| age |
1.034 |
0.02428 |
42.58 |
0 |
| hypertension |
0.8032 |
0.04665 |
17.22 |
1.977e-66 |
| heart_disease1 |
0.7626 |
0.06015 |
12.68 |
7.801e-37 |
| bmi |
10.24 |
0.2809 |
36.44 |
1.124e-290 |
| HbA1c_level |
2.491 |
0.03768 |
66.12 |
0 |
| blood_glucose_level |
10.3 |
0.149 |
69.11 |
0 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
58163 on 99999 degrees of freedom |
| Residual deviance: |
23028 on 99992 degrees of freedom |
The coefficients shown above are the ones stepwise logistic
regression has chosen for the final model. They are HbA1c level, blood
glucose level, bmi, hypertension, gender, and age.
Both selection types selected the same model! However I will still go
through the steps of cross-validation as though I was comparing the two
models.
Cross-Validation
Let’s explore cross-validation to see which model is best fit. For
logistic regression, it is meaningful to look at accuracy and kappa.
#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))
train_control <- trainControl(
method = "cv", # cross-validation
number = 10, # 10-fold
classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)
#run CV model
cv_model <- train(
diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi,
data=df_trans,
method = "glm",
family = "binomial",
trControl = train_control
)
pander(print(cv_model))
Generalized Linear Model
1e+05 samples 7e+00 predictors 2e+00 classes: ‘No’, ‘Yes’
No pre-processing Resampling: Cross-Validated (10 fold) Summary of
sample sizes: 90000, 90000, 90000, 90000, 90000, 90000, … Resampling
results:
Accuracy Kappa
0.95934 0.7012814
Accuracy shows the overall correctness of the model predictions and
kappa measures the levels of agreement between variables. Values closer
to 1 suggest higher accuracy and better agreement. This model suggests
high accuracy and high agreement based on the values.
ROC Curve and AOC Curve
Confusion Matrices
ROC is also known as Receiver Operating
Characteristic Analysis. It is a technique using graphs that evaluates
the performance of a binary model. Before plotting the ROC curve, it is
necessary to calculate the true positive rate (TPR) and false positive
rate (FPR). Below, are confusion matrices that were used to calculate
those values.
#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)
# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)
# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
cat("\nConfusion Matrix for Threshold =", threshold, "\n")
# Convert probabilities to predictions
# am: 1 = manual transmission, 0 = automatic transmission
predictions <- ifelse(probabilities > threshold, "Yes", "No")
all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
# Generate confusion matrix
cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
print(cm$table)
}
Confusion Matrix for Threshold = 0
Reference
Prediction No Yes
No 352 0
Yes 8148 91500
Confusion Matrix for Threshold = 0.25
Reference
Prediction No Yes
No 4217 144
Yes 4283 91356
Confusion Matrix for Threshold = 0.5
Reference
Prediction No Yes
No 5308 868
Yes 3192 90632
Confusion Matrix for Threshold = 0.75
Reference
Prediction No Yes
No 6322 3340
Yes 2178 88160
Confusion Matrix for Threshold = 1
Reference
Prediction No Yes
No 8500 91500
Yes 0 0
ROC Curve
#create ROC curve
TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)

AUC (Area under the Curve)
AUC quantifies the performance of the ROC curve into
a single value that ranges from 0 to 1. The plot below uses the Riemann
Sum approximation to estimate the area under the curve.
#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
geom_line(col = "steelblue") +
geom_point(col = "red") +
geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
ggtitle("Approximating the AUC of Logistic Model") +
xlab("1-specificity (FPR)") +
ylab("Sensitivity (TPR)") +
annotate("text", x = 0.025, y = 0.125, label= "A") +
annotate("text", x = 0.105, y = 0.5, label = "B") +
annotate("text", x = 0.605, y = 0.5, label = "C") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = c(0.8, 0.2),
plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)
AUC Value
The are under the curve is equal to 0.9795 which is close to 1, thus
indicating an ideal model.
A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)
AUC<- A+B+C
pander(AUC, caption= "Area under the Curve", position="Center")
0.9795
Conclusion for Model Fit
Our best fit model is : \[
\text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times
\text{age} -0.8\times \text{hypertension} - 0.76\times \text{heart
disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level}
-10.3\times \text{blood glucose level}
\]
The best fit model was selected based on a variety of different
features. Both stepwise selection and random forest selected the model
as best fit. After running performance analyses on the model, the model
was shown to be of high performance. This means that the variables
chosen for the final model are significant for predicting diabetes. The
model chose not to select smoking history as significant predictor for
the response variable. A possible suggestion for this would be to reduce
the number of response values to a binary response. Such as, “Have you
ever smoked?” and have the response be “Yes” or “No”. This could help in
predicting the relationship between smoking and diabetes.
---
title: "Applied Machine Learning for Data Analysis"
author: "Gabriella Khalil"
date: "2025-02-18"
output: 
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---


```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}
knitr::opts_chunk$set(echo = TRUE,   # include code chunk in the output file
                      warning = FALSE,# sometimes, you code may produce warning messages,
                                      # you can choose to include the warning messages in
                                      # the output file. 
                      results = TRUE, # you can also decide whether to include the output
                                      # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
library(tidyverse)
library(VIM) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(plotly)
library(DescTools)
library(cowplot)
library(mice)
library(moments)
library(tidyr)
library(reshape2)  # For data reshaping
library(caret)
library(pROC)
library(randomForest)
library(fastDummies)
library(kableExtra)
library(pander)
```
# Overview of Data

The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle. 

Mustafa, T. Z. (2021). *Diabetes Prediction Dataset*. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The dataset consists of 8 predictor variables for diabetes:

* **gender** - the patient's gender as male or female.

* **age** - the age of the patient in years.

* **hypertension** - yes or no, on whether the patient has hypertension.

* **heart disease** - yes or no, on whether the patient has heart disease.

* **smoking history** - the patient's smoking history indicated as never, no info, current, former or not current.

* **bmi** - the patient's body mass index (kg/m**2).

* **HbA1c level** - the patient's average blood sugar level over the past 2-3 months (%).

* **blood glucose level** - amount of blood sugar in the patient's blood (mg/dL).



```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot of shows the frequency of patients with diabetes and hypertension. The plot indiactes a higher frequency of patients not having diabetes and not having hypertension. There does not seem to be an interaction between having diabetes and having hypertension based on the low frequency of patients having both. '}
#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data

#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")

#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
  geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
  labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
        fill= "Diabetes")

```

## Comparison of Two Categorical Variables

The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to "Yes" and the value 0 to "No." These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the interaction between blood glucose levels and HbA1c levels on diabetes. Based on the plot, higher blood glucose levels and HbA1c levels are positively correlated with the presence of diabetes. In comparison, lower levels, indicate that a patient will more commonly not have diabetes. The height of the points shown for diabetes represent the levels of HbA1c, based on the plot, HbA1c levels predict the presence of diabetes more often than blood glucose levels. '}


interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
  geom_line(size = 1) +  # Lines representing interaction
  geom_point(size = 2) + # Points for data
  labs(
    title = "Interaction of Blood Glucose and HbA1c Levels",
    x = "Blood Glucose Level",
    y = "HbA1c Level",
    color = "Presence of Diabetes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 12),
    legend.position = "top"
  )
interact

```
## Comparison of Two Numerical Variables

The two variables being compared here are Blood Glucose Level and HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level is a measure of average blood sugar levels over 2-3 months and is measured in %. Blood glucose level is the amount of blood sugar in a patients blood at a given time. Our plot shows a correlation between high levels in HbA1c and blood glucose levels and it's relationship with diabetes. This plot suggests patients with high levels of HbA1c and blood glucose levels are at high risk for diabetes.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplot shows the relationship between age and presence of diabetes. The boxplot shows that older patients on average are more often have diabetes. There are a few outliers in the diabetes population that appear to be under the age of 20 years. This could be representative of patients with Type I diabetes, since the data does not distinguish between Type I and Type II diabetes. '}

#plot of a numerical and categorical variable
 boxplot(data$age ~ data$diabetes,
         xlab="Presence of Diabetes",
         ylab="Patient Age",
         col = c("skyblue", "purple"),
         main="Visualization of Diabetes by Age",
         cex.main = 1.1,
         col.main = "navy")
```

## Comparison of a Numerical and Categorical Variables

This chart shows the relationship between age and diabetes diagnosis. This chart shows that as patients grow older they are at higher risk for diabetes. This would make age a predictor for diabetes, with the average age for those without diabetes being 40 years and 60 years for those with diabetes. However, we can see that we do have a few outliers for those with diabetes with some patients being diagnosed at 20 or younger. This could be from the population of people with Type I diabetes which is usually diagnosed in patients under 20. Our dataset does not distinguish between type I and type II diabetes.


# Conclusion for Comparison of Variables

Diabetes can be predicted using age, blood glucose levels, and HbA1c levels in patients. However, our results showed hypertension is not a good predictor of diabetes. Based on our findings, it would be interesting to look at factors that occur once a patient becomes older that could lead to diabetes at an older age. As well, diet could be another factor to look into, such as grams of sugar consumed on a daily or weekly basis, to see if sugar consumption can predict diabetes. Another important thing to analyze in future studies would be Type I versus Type II diabetes, as it could be the reason for some of the outliers in the boxplot. 


# Overview of Missing Variables

An important step in data analysis for machine learning is handling missing values. Missing values in a dataset can occur for many reasons, but not limited to non-response in surveys, participants leaving the study, data entry errors and system limitations. If missing values are not addressed, models can become biased and accuracy of the anaylysis can decrease. This section examines different imputation strategies to effectively handle missing values.

The imputation methods we will examine include:

Replacement Imputation for Categorical Features: Mode imputation.

Regression-based Imputation for Numerical Features: Predictive modeling to estimate missing values.

Multiple Imputation: Advanced methods such as MICE to improve robustness.

```{r}
# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA

```

## Mode Imputation for Categorical Variables

In this method, missing variables are replaced with the most frequent category of the corresponding variable. Mode imputation ensures consistency and works well when missing values are randomly distributed. Below, mode imputation is used on the variable, heart disease, which is 1 if the patient has heart disease and 0 if the patient does not have heart disease.The variable smoking history had current, former, no info, never, and ever as possible values. No Info indicates that we do not have information on the participant's smoking history and will be treated as a missing values.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the comparison between the original data containing missing values for the variables heart disease and smoking history, and the data after using mode impute. The histograms show that mode impute used what was most common to replace the missing values. This is seen clearly in the variable smoking history, where the category never went from originally a count of about 40,000 to about 70,000. Heart disease had a small amount of missing data (only 15), however the 15 missing were changed to 0, indicating the patient would not have heart disease.', echo=FALSE}


diabetesd <-diabetesd
# Function to impute mode

Mode <- function(x) {
  ux <- unique(na.omit(x))  # Remove NAs before computing mode
  tab <- tabulate(match(x, ux))
  mode_value <- ux[which.max(tab)]
  
  # Ensure mode_value is of the same type as x
  if (is.factor(x)) {
    return(factor(mode_value, levels = levels(x)))
  } else if (is.character(x)) {
    return(as.character(mode_value))
  } else {
    return(as.numeric(mode_value))
  }
}
#apply mode imputation
value_imputed <- data.frame(
  original = diabetesd$heart_disease,
  original2=diabetesd$smoking_history,
  imputed_modehd = replace(diabetesd$heart_disease, is.na(diabetesd$heart_disease), Mode((diabetesd$heart_disease))),
  imputed_modesh = replace(diabetesd$smoking_history, is.na(diabetesd$smoking_history), Mode(diabetesd$smoking_history)))
summary(value_imputed)

#plot the comparison of original data and mode imputed data
h1 <- ggplot(value_imputed, aes(x = original)) +
  geom_bar(bins=10, fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original for Heart Disease") +
  theme_classic()
h2 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "#15ad4f", color = "#000000", position = "identity") +
  ggtitle("Original for Smoking History") +
  theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_bar(bins=10, fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Heart Disease") +
  theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "#ad8415", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Smoking History") +
  theme_classic()


plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)



```

## Regression-Based Imputation for Numerical Variables

Using regression models, we can estimate missing values. For the variable, bmi, we can predict the missing values using estimates from our model. Bmi is a continuous variable, unlike the previous example where our values could only be 0 or 1. In this method, a linear model was created to help create estimates of our values. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the variable bmi before and after regression imputation.' }

#create a dataset for regression impute
regimpute<-diabetesd

# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi))  # Get row indices


# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes, 
               data = regimpute, na.action = na.exclude)

# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])


dep.mi <- bind_rows(
  data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
             #imp = "dep.imp1")


i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)


 

```

## MICE Imputation

MICE is multiple imputation techniques used to impute missing values. This method also uses regression models and accounts for uncertainty by generating multiple possible values. Using MICE, the variables smoking history, heart disease, and bmi, can be imputated within one function. MICE can handle different types of variables, imputates based on the relationship between variables and it accounts for variability in missing data. 
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='Based on the three graphs, MICE seems to have produced values that appear almost identical to the original dataset. Mode impute seems to be slightly different since it relies on filling missing values with the most common value. In this case, mode impute filled all missing values as 0 and under-represented 1.', echo=FALSE}

#reload dataset so that it is free of mode and regression imputation done previously
data2 <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')

#create missing values and make smoking history a binary variable
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
data2$smoking_history[data2$smoking_history== "No Info"] <-NA
data2$bmi[ltr.missing.id] <- NA
data2$heart_disease[gpa.missing.id] <- NA

#make variables numeric
data2$gender <- as.numeric(as.factor(data2$gender))
data2$heart_disease<- as.factor(as.numeric(data2$heart_disease))
data2$smoking_history<-as.factor(data2$smoking_history)
#apply MICE imputation
df_mice <- mice(data2, method = c("","","","pmm","polyreg","pmm","","",""),  m = 20, maxit=20,seed = 123, print=F)

# Complete dataset with imputed values
df_imputed <- complete(df_mice)

#plot comparisons of before and after MICE imputation for heart disease
g1 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_bar(fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Heart Disease")+
  theme_classic()
g2 <- ggplot(value_imputed, aes(x = original)) +
  geom_bar(fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original") +
  labs(x="Heart Disease")+
  theme_classic()
g3 <- ggplot(df_imputed, aes(x = heart_disease)) +
  geom_bar(fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Heart Disease")+
  theme_classic()
plot_grid(g1, g2, g3, nrow=1, ncol =3)
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap= 'The bar graphs show the comparison of the original data to mode imputation and MICE. MICE appears to have dispersed the data more evenly across the different categories, while mode imputation has only added values to the category never.'}
#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
  geom_bar( fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Smoking History")+
  theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "red", color = "#000000", position = "identity") +
  ggtitle("Original Smoking History") +
  labs(x="Smoking History")+
  theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "blue", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Smoking History")+
  theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
```
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the original data to both MICE and regression imputation. The appear to have similar results, based on mean which is indicated by the black line within each box.'}

#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean BMI in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)



```

```{r fig.align='center', fig.width=7, fig.height=5}
model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5)) 


```

The statistics above show the summary of the newly imputed missing variables through MICE. Based on our p-values, smoking history will most likely not be chosen for the final model as it's significance level is greater than 0.05, indicating an insignificant predictor variable. 


# Skewness

Skewness is a measure of symmetry within a distribution. The skewness measurements below are of the variables age, bmi, HbA1c level, and blood glucose level, respectively. When skewness is equal to 0, we can assume the data is normally distributed. When skewness is greater than 0, our data is positively skewed, while a value less than zero indicates negatively skewed data. For the diabetes dataset, the variables bmi and blood glucose level are 1 or about 1, which indicate a significant right skew in the distribution. 
```{r}



age1<- skewness(df_imputed$age) 
bmi1<-skewness(df_imputed$bmi) 
HbA1c1<-skewness( df_imputed$HbA1c_level) 
glucose1<-skewness(df_imputed$blood_glucose_level)

mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
pander(mytable, caption= "Skewness of Numeric Variables",position="center")

```




# Transforming Data

To handle our skewed variables, we can normalize the variables bmi and blood glucose level. Normalization rescales data into a fixed range of [0,1]. By normalizing our data, we can fix skewness and create normally distributed data. The variables, age and HbA1c level have been standardized to keep the distribution centered. Standardization is used to create a mean of 0 and a standard deviation of 1, to follow a normal distribution. Based on the results below, we can see that the transformation has corrected the right skew in the data from before.


```{r}
df_trans <- df_imputed  # Copy original dataset after MICE

#ensure diabetes is numeric and a factor
df_trans$diabetes<- as.numeric(as.factor(df_trans$diabetes))
# Min-Max Normalization
normalize <- function(x) {
  if (max(x) - min(x) == 0) return(rep(0, length(x)))  # Prevent division by zero
  return((x - min(x)) / (max(x) - min(x)))
}

#apply normalization
df_trans$bmi <- normalize(df_imputed$bmi)
df_trans$blood_glucose_level <- normalize(df_imputed$blood_glucose_level)

# Z-Score Standardization
standardize <- function(x) {
  return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

#apply standardization
df_trans$age <- standardize(df_trans$age)
df_trans$HbA1c_level <- standardize(df_trans$HbA1c_level)

# Apply Log Transformation to Fix Skewness (if necessary)
df_trans$bmi <- log1p(df_trans$bmi)  # log(1 + x) avoids log(0) issues
df_trans$blood_glucose_level <- log1p(df_trans$blood_glucose_level)

bmi2 <- skewness(df_trans$bmi)
age2 <- skewness(df_trans$age)
HbA1c2<- skewness(df_trans$HbA1c_level)
glucose2<- skewness(df_trans$blood_glucose_level)

newtable<-data.frame(bmi2,age2,HbA1c2,glucose2)
pander(newtable, caption="Newly Transformed Data", position="center")


```

Transforming our data has successfully decreased skewness in our variables. We can see bmi and blood glucose level are no longer about 1. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='We can see in the graphs above that after transforming the dataset, the data seems to follow a normal distribution more closely.This is apparent in the histograms appearing more in a bell shape curve.' }

# Combine old and new data for comparison
df_compare <- data.frame(
  bmi_original = df_imputed$bmi,
  bmi_transformed = df_trans$bmi,
  blood_glucose_original = df_imputed$blood_glucose_level,
  blood_glucose_transformed = df_trans$blood_glucose_level
)

# Reshape data for ggplot
df_long <- melt(df_compare)

# Plot histograms
ggplot(df_long, aes(x = value, fill = variable)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Comparison of Original vs. Transformed Data", x = "Value", y = "Count")


```




# Analyzing Data
The next step is to find which variables are significantly important when trying to predict diabetes. By using feature selection, we are able to predict possible best fit models for our variable, diabetes. For this data, we will focus on logistic regression. The variables of interest, diabetes, is a binary variable indicating linear regression would not make an ideal model. 



## Random Forest
First, we will look at the best model selected by feature selection through random forest:
```{r}

# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)


# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# Apply RFE to identify top features
results <- rfe(
  x = df_trans[, !colnames(df_trans) %in% "diabetes"],
  y = df_trans$diabetes,
  sizes = c(1:9),
  rfeControl = control
)

# Print the selected features
models<-data.frame(predictors(results))
pander(models, caption="Random Forest Selected Model",position="center")


```

The table shows the selected variables random forest has selected to use for our model


## Stepwise Logistic Regression
Next, let's look and stepwise logistic regression

```{r}
#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)

# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
pander(summary(step_model1))

```

The coefficients shown above are the ones stepwise logistic regression has chosen for the final model. They are HbA1c level, blood glucose level, bmi, hypertension, gender, and age. 

Both selection types selected the same model! However I will still go through the steps of cross-validation as though I was comparing the two models.

# Cross-Validation

Let's explore cross-validation to see which model is best fit. For logistic regression, it is meaningful to look at accuracy and kappa. 
```{r}

#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))

train_control <- trainControl(
  method = "cv",         # cross-validation
  number = 10,           # 10-fold
  classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)

#run CV model
cv_model <- train(
  diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi, 
  data=df_trans, 
  method = "glm", 
  family = "binomial", 
  trControl = train_control
  )

pander(print(cv_model))

```

Accuracy shows the overall correctness of the model predictions and kappa measures the levels of agreement between variables. Values closer to 1 suggest higher accuracy and better agreement. This model suggests high accuracy and high agreement based on the values.

# ROC Curve and AOC Curve

## Confusion Matrices

**ROC** is also known as Receiver Operating Characteristic Analysis. It is a technique using graphs that evaluates the performance of a binary model. Before plotting the ROC curve, it is necessary to calculate the true positive rate (TPR) and false positive rate (FPR). Below, are confusion matrices that were used to calculate those values.
```{r}

#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)

# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)

# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
  cat("\nConfusion Matrix for Threshold =", threshold, "\n")
  
  # Convert probabilities to predictions
  # am: 1 = manual transmission, 0 = automatic transmission
  predictions <- ifelse(probabilities > threshold, "Yes", "No")
    all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
    # Generate confusion matrix
  cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
  print(cm$table)
}

```



## ROC Curve
```{r fig.align='center', fig.width=7, fig.height=5,'The plot shows the path of the ROC curve.'}

#create ROC curve

TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
     xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
       col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
```




## AUC (Area under the Curve)

**AUC** quantifies the performance of the ROC curve into a single value that ranges from 0 to 1. The plot below uses the Riemann Sum approximation to estimate the area under the curve.
```{r fig.align='center', fig.width=7, fig.height=5,'The plot estimates the area of the ROC Curve. The region is divided into subregions, A,B, and C. These values are used to calculate the area under the curve.' }
#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
        geom_line(col = "steelblue") +
        geom_point(col = "red") +
        geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
        geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
        ggtitle("Approximating the AUC of Logistic Model") +
        xlab("1-specificity (FPR)") + 
        ylab("Sensitivity (TPR)") +
        annotate("text", x = 0.025, y = 0.125, label= "A") + 
        annotate("text", x = 0.105, y = 0.5, label = "B") +
        annotate("text", x = 0.605, y = 0.5, label = "C") +
        theme(plot.title = element_text(hjust = 0.5),
              legend.position = c(0.8, 0.2),
              plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
              axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)
```

## AUC Value

The are under the curve is equal to 0.9795 which is close to 1, thus indicating an ideal model.
```{r fig.align='center', fig.width=7, fig.height=5,'The area under the curve is 0.9795 which is close to 1, this indicates an ideal model.'}
A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)

AUC<- A+B+C
pander(AUC, caption= "Area under the Curve", position="Center")

```

# Conclusion for Model Fit
Our best fit model is :
$$
\text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times \text{age}  -0.8\times \text{hypertension} - 0.76\times \text{heart disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level} -10.3\times \text{blood glucose level}
$$


The best fit model was selected based on a variety of different features. Both stepwise selection and random forest selected the model as best fit. After running performance analyses on the model, the model was shown to be of high performance. This means that the variables chosen for the final model are significant for predicting diabetes. The model chose not to select smoking history as significant predictor for the response variable. A possible suggestion for this would be to reduce the number of response values to a binary response. Such as, "Have you ever smoked?" and have the response be "Yes" or "No". This could help in predicting the relationship between smoking and diabetes. 
